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Abstract — Web server logs have abundant information about 
the nature of users accessing it. The analysis of the users 
current interestbased on the navigational behavior may help 
the organizationsto guide the users in their browsing activity 
and obtain relevantinformation in a shorter span of time 
[l].Web usage mining is used to discover interesting user 
navigationpatterns and can be applied to many real- world 
problems, such as improving Web sites/pages, 
makingadditional topic or product recommendations, user/ 
customer behavior studies, etc [23]. Web usage mining, in 
conjunction with standard approaches to personalization 
helps to address some of the shortcomings of these 
techniques, including reliance on subjective lack of 
scalability, poor performance, user ratings and sparse 
data[2,3,4,5,6]. But, it is not sufficient to discover patterns 
from usage data for performing the personalization tasks. 
It is necessary to derive a good quality of aggregate usage 
profiles which indeed will help to devise efficient 
recommendation for web personalization [11, 12, 13].Also 
the unsupervised and competitive learning algorithms has 
help to efficiently cluster user based access patterns by 
mining web logs [19, 20, 24]. 

This paper presents and experimentally evaluates a 
technique for finely tuninguser clusters based on similar 
web access patterns on their usage profiles by approximating 
through least square approach. Each cluster is having users 
with similar browsing patterns. These clusters are useful 
in web personalization so that it communicates better with 
its users.Experimental results indicate thatusing the 
generated aggregate usage profiles with approximating 
clusters through least square approach effectively 
personalize at early stages of user visits to a site without 
deeper knowledge about them. 

IndexTerms — Aggregate Usage Profile, Least 
SquareApproach, Web Personalization, Recommender 
Systems,Expectation Maximization. 

I. Introduction 

Tremendous growth of unstructured information available 
on internet & e-commerce sites makes it very difficult to access 
relevant information quickly & efficiently. Web data research 
has encountered a lot of challenges such as scalability, 
multimedia and temporal issues etc.Web user drowns to huge 
information facing the problem of overloaded information. Web 
personalization is the process based on users past behavior 
for providing users with relevant content including massive 
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information of web pages link, relevant data, products 
etc. Traditionally, collaborative filtering technique was 
employed to do this task. It generally produces 
recommendations on objects yet not rated by user, by 
matching the ratings of current user for objects with those of 
similar users. To increase the user click rate and service quality 
of Internet on a specific website, Web developer or designer 
needs to know what the user really wants to do & its interest 
to customize web pages to the user by learning its 
navigational pattern. Various approaches are defined to unreal 
the applicative techniques to gethigher and corrective 
recommendations for user surf. 

II. Overview Of Related Work 

Various approaches have been devised for recommender 
systems[2, 3, 4, 5] . The explicitfeedback from the user or rating 
on items help to match interest with online clustering ofusers 
with "similar interest" to provide recommendations. But 
practically itleads toward limitations of scalability and 
performance[6] due to the lack of sufficient user information. 
Otherapproaches relating usage mining are implied to discover 
patterns or usage profiles from implicitfeedback such as page 
visits of users. The offline pattern discoveryusing numerous 
data mining techniques are used toprovide dynamic 
recommendations based on the user's 
shortterminterest."WebPersonalizer" a usagebasedWeb 
Personalization system using Web mining techniques to 
provide dynamicrecommendations was proposed in [6]. In 
[7], a novelapproach using LCS algorithmimproves the quality 
of the recommendations system for predictions by classifying 
user navigation patterns. K-means clustering followed 
byclassification for recommender systems [8] is used to 
predict the futurenavigations and has improved the accuracy 
of predictions.Recent developmentsfor online personalization 
through usagemining have been proposed. In [2], 
experimental evaluation of two different techniques suchas 
PACT and ARHP based on the clustering of user transactions/ 
pageviews respectively for the discovery ofusage profiles 
was proposed.In [9], Poisson parameters to determine the 
recommendation scores helped tofocus on the discovery of 
user's interest in a sessionusing clustering approach. They 
are used to recommend pages to the user. This novel approach 
in [9] involving integrated clustering, association rules and 
Markovmodels improved webpage prediction accuracy. 
Various clustering algorithms had helped to group the user 
sessions as like K-means, Fuzzy C-means and Subtractive 
Clustering [16, 17, 23, 24]. The clusters formed as a result of 
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applying these algorithms are aggregated to form web 
profiles. The recommendation engine uses these profiles, to 
generate pages for recommendation. In [21, 22], Formal 
Concept Analysis approach is used todiscover user access 
patterns represented asassociation rules from web logs which 
can then be used forpersonalization and 
recommendation. However, the existingsystem does not 
satisfy users particularly in large web sites in terms of the 
quality of recommendations. This paper proposes 
toclassifyusernavigation patterns through web usage mining 
system andeffectively provideonline recommendation. The 
tested resultson ritindia.edu dataset indicate to improve the 
quality of thesystem for recommendations. 

M. Methodology 

Personalization using usage mining consists of four basic 
stages. The process embeds of: 

1) Data Preprocessing 

2) Pattern discovery 

3) Pattern recognition 

4) Recommendation process 

Classifying and matching an online userbased on his 
browsing interestsforrecommendations of unvisited pages 
has been employed in this paper using usage mining to 
determine the interest of "similar" users. 
Therecommendation[10] consists of offlinecomponent and 
online component. The offline componentinvolves Data 
Preprocessing, Pattern Discovery and PatternAnalysis. The 
outcome of the offline component is thederivation of 
aggregate usage profiles using web usage miningtechniques. 
The online component is responsible for matchingthe current 
user's profile to the aggregate usage profiles to generate the 
necessary recommendations, 
i. Data Preprocessing : 

Preprocessing is the primary task of personalization 
involving cleansing of data, session and user identification, 
page view and transaction identification [10, 14, 15]. Let there 
be set of pages P= {p^p^p^p^. . -.,p n } and set of n sessions, 
S= { s,,s„s, ,s } where each s.° S is a subset of P. A file 

1 1' 2' 3, n> l 

consisting of session profile of user requests for pages is 
maintained.In [1], for a particular session, a session-pageview 
matrix is maintainedconsisting of a sequence of page requests 
in that session. A row representing a session and every 
column represents a frequency of occurrence of pageview 
visit in a session. Then the weight of the pageview is 
determined by evaluating the importance of a page interms 
of the ratio of the frequency of visits to the page withrespect 
to the overall page visits in a session and is represented by a 
weighted session-pageview matrix. Each session s is modeled 
as a vector over the n-dimensionalspace of pageviews. 
//. Pattern Discovery : 

The primary task of pattern discovery is to find out 
thehidden patterns using various mining techniques such 
asclustering, association rule, classification etc., which helps 
to uncover the user behavior with respect to the site. It is an 
offline task which helps to determine sessions with similar 



navigational patterns/interest from the user session file. 
Clustering techniqueis employed to determine the session 
clusters using model-based Expectation Maximization as in 
[1 ] . The profile interest islearnt by determining an aggregate 
usage profile using theformula: 

wherein wi ; represents the weight of the page in sessions ° c 
and nc represents the number of sessions in cluster c. Table 1 
shows the aggregate usage profiles for 6 clusters under 13 
distinct categories of pageviews URLs, (explained in IV) 

TABLE I. Aggregate Usage Profiles 
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//. Pattern Recognition : 

The individual profile effectiveness is measured using 
weighted average visit percentage. It is to represent the 
significance of user's interest in the cluster.If the aggregate 
usage profilesconsist of m clusters and k pages, then the 
significance in thecluster can be determined as follows: 



wherel(j) is the index of the maximum value in each page 
andM I(j) represents the maximum value. Also the weight of 
page is considered as per pageview j. This maximization 
function is used to recommend pages to users belonging to 
a profile/cluster. 
iv. Recommendation process : 

The recommendation engineis the online component of a 
usage-based personalization system. The goal of 
personalization based on anonymous Webusage data is to 
compute a recommendation set for the current (active) user 
session, consisting of the objects (links, ads, text, products, 
etc.) that most closely match the current user profile. 
Recommendation set can represent a long/short term view of 
user's navigational history based on the capability to track 
users across visits. The test data of user sessions are taken 
as sequence of pages in time order. An active window size is 
fixed and those many pages are taken from the user session 
as active session. Then the similarity between the active 
session and all the cluster profiles is calculated using a vector 
similarity measure and the most similar profile selected for 
recommendations [1,2, 25]. If an active session is represented 
as s.andcluster as c k , then their similarity can be measured as 
follows' 
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wherevv. ., represents weight of page i in active session jand 
w. ., represents weight of page i in cluster k. The method of 
least squares assumes that the best-fit similarity of a given 
type is the matching score that has the minimal sum of the 
deviations squared (least square error) from a given set of 
data.Suppose that the data points are sim(s 1 ,c 1 ),sim(s 2 ,c 2 ),...., 
sim(s.,c k ) where sis the independent variable and c k is the 
dependent variable. The fitting score (s.) has the deviation 
(error) d from each data point,i.e., d 1 =c 1 -(s 1 ),d 2 =c,-50SU(s 2 ), 
...,d =c -50SU(s ). According to the method of least squares, 
the best matching score has the property that: 

aim'Cy.ck) = aim (sj.ck) - EE (5") 
t his least square error approach helps to tine tune the 
scores such as to approximate user clusters based on similar 
web access patterns on their usage profiles. Thena 
recommendation score for each page view p in the selected 
cluster/profile is calculated. If C is the most similar cluster/ 
profile to the active session S, then a recommendation score 
for each page view p in C is as: 

Rjec£S,p) = 1 /weightfp..Cj « sLm'fSj, CjJ (6~) 
Profiles having a similarity greater than a threshold value 
H are selected as matching clusters in the decreasing order 
of their scores. The weight of pageview p in C is computed 
twice i.e. directly and indirectly but to compensate the impact, 
square root in the above function is taken and results are 
normalized to value between and 1 . If the pageview p is in 
the current active session, then its recommendation value is 
set to zero. These matching clusters can be used 
forrecommending pages instantaneously which have not 
beenvisited by the user. 

The following figure shows the overall process of web 
personalization using web usage data. 
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Figure l.THE OVERALL PROCESS OF PERSONALIZATION 
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IV. Experimental Evaluation 

This section provides a detailed experimental evaluation of 
the profile generation techniques. The privately available data 
set at the University of Shivaji, containing web log files of 
ritindia.edu web site have been used for this research. It 
includes the page visits of users who visited the "ritindia.edu" 
web site in period of June 2006 to April 201 1 . The initial log 
file produced a total of 16,233 transactions and the total number 
of URLs representing pageviews was 27. By using Support 
filtering for long transactions,pageviews appearing in less 
than 0.5% or more than 85% of transactions were eliminated. 
Also, short transactions with at least 5 references were 
eliminated. The visits are recorded at the level of URL category 
and in time order, which includes visits to major 13distinct 
categories ofpageviews URLs. Each sequence in the dataset 
corresponds to a user's request for a page. 
The 13 categories are: 
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A clustering model is estimated using approximately 15,000 
samples within the dataset. They are further classified into 
training and testing sets. Dataset is split into 70% training 
and 30% testing sets such that the model is designed using 
training set and then evaluated using test samples for 
performance. Applying the clustering algorithm for 
Expectation Minimization with 12 iterations resultsin 6major 
clusters. Each cluster represents several sessions of 
navigational patterns representing "similar" interest in the 
web pages or the usage profile and the aggregate usage 
profile is determined using Equation(l). During the online 
phase, the pages visited in a session are stored in a user 
session file and after each page visit, the relative frequency 
of pageviews in the active session is determined. An active 
session with sliding window size 'n' (in our experiment, the 
size is 5 as it represents the average number of pagevisits in 
the dataset) consists of the current page visit and the most 
recent n-1 pages visited. The window slides, as the user 
browses through various pages. Now, using the cosine 
similarity measure,the active session is matched with the 
aggregate usage profiles and matching cluster(s)having value 
greater than the threshold and are used for recommending 
pages exceeding threshold that have not been visited by the 
user. 

For example, consider the following 3 sessions consisting of 

page visits: 

18 

1813 

1 13 1 13171 

The sliding window consists of the pages 1 13 1 13 1 in the 
fifth page visit. 
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Table 2 and 3represents the page visits and frequency of 
visited pages in the sliding window. It states the weight of 
the pageviewby evaluating the importance of a page in terms 
of the ratio of the frequency of visits to the page with respect 
to the overall page visits in the active session. 



TABLE II. Page visits in the sliding window 



5 ?s sion 


Ord?r or 
visit 


Win-dow 


Activ a S » s sion[T?aE? Wsits^Tj 


1 


I 


1 
















I 


2 


2 




S 











2 


1 


1 
















2 


2 


2 




s 










2 


3 


3 




s 


13 








3 


1 


1 







a 








3 


2 


2 




13 










3 


3 


3 




13 


l 


3 





3 


4 


4 




13 


l 


13 





3 








13 


i 


I 3 


I 


3 


6 


6 


13 


I 


13 


I 




3 




7 


1 


13 


1 




1 



TABLE III. FREQUENCY OF Page visited/Weighted Pageview 
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TABLE IV. MATCHING CLUSTERS 
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Clusters greater than the threshold value, are chosen to be 
matching clusters as shown in Table4. This table depicts the 
comparative study of aggregate usage profiles and the 
maximization function to show recommendations. It has been 
found that when the user visits page 1 (window size 1), the 
appropriate clusters, exceeding the threshold value are cluster 

2 and 4. It is seen that pages 2,3,6,7,9 can be recommended 
from cluster 2 and 4. Similarly when the user visits page 8 
subsequent to pagevisit 1 (window size 2), the appropriate 
matching clusters are cluster 2, cluster 3 and cluster 4. As the 
window size increases to the fixed size limit (n=5), 
correspondingly, the matching clusters for the visited page(s) 
in the active session and the recommendations are dynamic 
in nature. Table 5 shows the recommended set of pages for all 

3 demo sessions. 



TABLE V. RECOMMENDATION SET FOR SESSION 1, SESSION2, 
SESSION3 
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As compared to experimental study in [1], the 
recommendation of pages for demo sessions is calculated on 
Equation (3) for measuring similarities of sessions. Then the 
pages are recommended in the session as per Equation(6) for 
simple similarities of clusters which provides a less precision 
for recommending the matching scores of clusters. As 
compared to same, the least square approach analysis in our 
study helps to find deviations & to refine further the 
recommendations as per Equation (6) with similarity 
measures. In [1], if the above 3 demo samples are measured, 
then the recommendation set varies a lot recommending less 
pages under given threshold. 

Conclusion and Future work 

The ability to collect detailed usage data at the level of 
individual mouse click provides Web-basedcompanies with 
a tremendous opportunity for personalizing the Web 
experience of clients. The practicality of employing Web usage 
mining techniques for personalization is directlyrelated to 
the discovery of effective aggregate profiles that can 
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successfully capture relevant usernavigational patterns and 
can be used as part of usage-basedrecommendersystemto 
provide real-timepersonalization. 

In this work, the primary objective was to classify and match 
an online userbased on his browsing interests. Identification 
of the current interests of the user based on theshort-term 
navigational patterns instead of explicit userinformation has 
proved to be one of the potential sources forrecommendation 
of pages. Inparticular context ofanonymous usage data, these 
workunder least square approximation show promise in 
creatingeffective personalization solutions that can help retain 
and convert unidentified visitors based ontheir activities in 
the early stages of their visits. Future work involves with 
various types of transactions derived from user sessions, 
such as to isolate specifictypes of "content" pages in the 
recommendation process. Also theplan is to 
incorporateclient-side agents and use of optimization 
techniques to assess thequality of recommendations. 
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